Student Information

First Name:

Last Name:

Student ID:

Aalto E-mail:


Assignment 3: Programming Part

For this part, you'll be using PySpark to process tweets, i.e., messages generated on twitter.

This Jupyter notebook contains: (i) instructions to setup the programming environment, and (ii) the programming problems. You will use this notebook to write your code and submit it for grading.

Setup Instructions

To run your code, you will need to connect to a Jupyter server maintained by CSC. Please read the following instructions very carefully.

Main Steps

  • Open your browser and go to server address https://86.50.170.119.
  • If you are asked for credentials, provide username: student and password: moderndb.
  • Once you're in the Jupyter "Home" page, upload your copy of the notebook by clicking Upload on the top-right corner.
  • You will now see the notebook listed in "Home". Click to open it on a new tab and work on the programming problems.
  • Once you have completed all or part of your work, make sure you download the notebook to your computer. You do that by selecting File > Dowload as > IPython Notebook. At the end of your session (after you download your notebook), select File > Close and Halt.
  • Submit the notebook that contains your final solutions (code and output) to mycourses.

Saving Your Work

The server will restart after 15 minutes of inaction.

Therefore, if you have made changes to the notebook but want to pause your work, make sure you download the notebook to your computer. You can upload it again when you're ready to resume your work.

What to do if the server restarts while inactive To save the code you wrote, copy and paste it manually to a text file on your computer. You can then follow again the main steps described above.

Small and big dataset

You will work with two datasets:

  • A small dataset. This dataset is already available. Its purpose is for you to work with it to develop solutions for the assignment.
  • A big dataset. This dataset will be available on Friday April 1st, 2016, together with instructions on how to access it. You will use it to produce your final solutions for the assignment.

Server Availability

When working with the big dataset, you will be using a bigger cluster than the one you'll be using when working with the small dataset. To see how many other students are using it and if there are available resources, visit this webpage: https://86.50.170.119/resources.

If you receive a message that says that the server is full, allow 15-20 minutes before you try to access it again.

Attention! You must develop and run your code before the day of the deadline. We cannot guarantee support for any failures that happen on that day.

PySpark Setup

Run the following cells once before you start working on your solutions.

Note: If you attempt to create a SparkContext twice, you will get an error.


In [ ]:
## Imports and SparkContext

import pyspark
import numpy as np # use numpy for advanced numeric operations
import ast # we'll use this module to read data

In [ ]:
## SMALL dataset

## Run this cell *ONLY* if you want to be
## using the SMALL dataset

SMALL_FILE = "file:///data/small.input"

DATA_FILE = SMALL_FILE

In [ ]:
## BIG dataset

## Run this cell *ONLY* if you want to be
## using the BIG dataset

BIG_FILE = "hdfs:///moderndb/input_2.5m.input"

## Environment parameters for the big cluster
## Use them ONLY if you're working with the big dataset
import os
os.environ["PYSPARK_PYTHON"]="/opt/conda/bin/python3"
os.environ["SPARK_HOME"]="/usr/hdp/current/spark-client"
os.environ["HDP_VERSION"]="current"

DATA_FILE = BIG_FILE

In [ ]:
sc = pyspark.SparkContext()
load_func = sc.textFile

Tweets

Run the following cell to assign the data to an RDD.


In [ ]:
data = load_func(DATA_FILE)

Most lines in the file contain a string representation of a python dictionary that represents a tweet. A tweet is a message that has been generated on twitter. Below you see the example of one tweet (one line in the file).

{u'contributors': None, u'coordinates': None, u'created_at': u'Fri Sep 26 08:01:38 +0000 2014', u'entities': {u'hashtags': [], u'symbols': [], u'trends': [], u'urls': [{u'display_url': u'bbc.in/1qBcZ4G', u'expanded_url': u'http://bbc.in/1qBcZ4G', u'indices': [126, 140], u'url': u'http://t.co/8kPs90syqo'}], u'user_mentions': [{u'id': 2190056023, u'id_str': u'2190056023', u'indices': [3, 9], u'name': u'BBC Outside Source', u'screen_name': u'BBCOS'}]}, u'favorite_count': 0, u'favorited': False, u'filter_level': u'medium', u'geo': None, u'id': 515410856616935425, u'id_str': u'515410856616935425', u'in_reply_to_screen_name': None, u'in_reply_to_status_id': None, u'in_reply_to_status_id_str': None, u'in_reply_to_user_id': None, u'in_reply_to_user_id_str': None, u'lang': u'en', u'place': None, u'possibly_sensitive': False, u'retweet_count': 0, u'retweeted': False, u'retweeted_status': {u'contributors': None, u'coordinates': None, u'created_at': u'Fri Sep 26 07:44:43 +0000 2014', u'entities': {u'hashtags': [], u'symbols': [], u'trends': [], u'urls': [{u'display_url': u'bbc.in/1qBcZ4G', u'expanded_url': u'http://bbc.in/1qBcZ4G', u'indices': [115, 137], u'url': u'http://t.co/8kPs90syqo'}], u'user_mentions': []}, u'favorite_count': 0, u'favorited': False, u'filter_level': u'low', u'geo': None, u'id': 515406597829693440, u'id_str': u'515406597829693440', u'in_reply_to_screen_name': None, u'in_reply_to_status_id': None, u'in_reply_to_status_id_str': None, u'in_reply_to_user_id': None, u'in_reply_to_user_id_str': None, u'lang': u'en', u'place': None, u'possibly_sensitive': False, u'retweet_count': 3, u'retweeted': False, u'source': u'<a href="http://www.socialflow.com" rel="nofollow">SocialFlow</a>', u'text': u"EU's anti-terrorism chief tells BBC the number of Europeans joining Islamist fighters in Syria and Iraq now 3,000+ http://t.co/8kPs90syqo", u'truncated': False, u'user': {u'contributors_enabled': False, u'created_at': u'Tue Nov 12 10:17:10 +0000 2013', u'default_profile': False, u'default_profile_image': False, u'description': u'Real-time reports from inside the BBC newsroom with @BBCRosAtkins. Combining your sources and ours. BBC World Service radio 10GMT, BBC World News TV 17GMT.', u'favourites_count': 734, u'follow_request_sent': None, u'followers_count': 16500, u'following': None, u'friends_count': 750, u'geo_enabled': False, u'id': 2190056023, u'id_str': u'2190056023', u'is_translator': False, u'lang': u'en-gb', u'listed_count': 340, u'location': u'London', u'name': u'BBC Outside Source', u'notifications': None, u'profile_background_color': u'131516', u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme14/bg.gif', u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme14/bg.gif', u'profile_background_tile': True, u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/2190056023/1399384694', u'profile_image_url': u'http://pbs.twimg.com/profile_images/421268342767222784/17yQM0_d_normal.jpeg', u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/421268342767222784/17yQM0_d_normal.jpeg', u'profile_link_color': u'009999', u'profile_sidebar_border_color': u'EEEEEE', u'profile_sidebar_fill_color': u'EFEFEF', u'profile_text_color': u'333333', u'profile_use_background_image': True, u'protected': False, u'screen_name': u'BBCOS', u'statuses_count': 6602, u'time_zone': u'London', u'url': u'http://www.bbc.co.uk/programmes/p01k2bx3', u'utc_offset': 3600, u'verified': True}}, u'source': u'<a href="http://twitter.com/#!/download/ipad" rel="nofollow">Twitter for iPad</a>', u'text': u"RT @BBCOS: EU's anti-terrorism chief tells BBC the number of Europeans joining Islamist fighters in Syria and Iraq now 3,000+ http://t.co/8\u2026", u'timestamp_ms': u'1411718498766', u'truncated': False, u'user': {u'contributors_enabled': False, u'created_at': u'Sun Nov 13 20:46:00 +0000 2011', u'default_profile': True, u'default_profile_image': False, u'description': u'The more I do nothing, the less time I have to do anything', u'favourites_count': 27, u'follow_request_sent': None, u'followers_count': 216, u'following': None, u'friends_count': 747, u'geo_enabled': True, u'id': 411752020, u'id_str': u'411752020', u'is_translator': False, u'lang': u'en', u'listed_count': 1, u'location': u'', u'name': u'Gerald Quinlan', u'notifications': None, u'profile_background_color': u'C0DEED', u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png', u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'profile_background_tile': False, u'profile_image_url': u'http://pbs.twimg.com/profile_images/1661078970/image_normal.jpg', u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/1661078970/image_normal.jpg', u'profile_link_color': u'0084B4', u'profile_sidebar_border_color': u'C0DEED', u'profile_sidebar_fill_color': u'DDEEF6', u'profile_text_color': u'333333', u'profile_use_background_image': True, u'protected': False, u'screen_name': u'QuinlanQuinlan', u'statuses_count': 9223, u'time_zone': None, u'url': None, u'utc_offset': None, u'verified': False}}

Note that some of the lines might be corrupt -- i.e., they do not contain a tweet, but other information, e.g., a logging message. It is your responsibility to deal with corrupt lines and make sure they do not affect your computation.

If tweet_string is the string representation of one tweet, you can load it into a Python dictionary with the following statement.

tweet = ast.literal_eval(tweet_string)

Problem 4 (20 points)

Use PySpark to write and execute queries for the following tasks (1 - 10). Generate one cell per task.

You are free to reuse (possibly persisted) RDDs and other code from earlier tasks.

Tasks

  1. Find out how many lines (tweets or corrupt) there are there in the data.

  2. Inspect any three tweets and infer the tweet schema from them, as completely as you can.

  • Each tweet is identified by its unique id value. Some tweets might appear in more than one lines in the data (i.e., the same tweet id might appear in more than one lines). How many tweets are there that appear in only one line?

  • Each tweet is associated with a lang value, that denotes the language the tweet was written in. How many different languages are there in the dataset?

  • How many lines are there in the data for each language? (Ignore tweet ids for this task).

  • How many tweets are there in the data for the english language (lang = 'en')?

  • Each tweet is associated with a text value that stores the content of the message. What is the minimum and maximum message length in the data? Note: you should compute both values in a single pass over the data.

  • Consider the text message that appears in a tweet. We define a word to be a maximal sequence of alphanumeric characters found in the text, after the text has been converted to lowercase. For example, if the text is

    My username is spark123 & my password is spar!.Kk

    then the words contained in it are the following.

    my, username, is, spark123, my, password, is, spar, kk

    Find the 1000 most frequent words in the text messages of all english tweets (lang = 'en') along with the number of their occurances.

  • Find the 100 most frequent words in english tweets (lang = 'en') that start with each character of the latin alphabet ('a', 'b', ..., 'z'). Use the lowercase version of messages.

  • Each tweet is associated with a user who generated it; and the user is associated with a unique screen_name. Find the screen_name of the $10$ users with the most english tweets (with distinct tweet ids) in the data.

Note Twitter users are also associated with an id value, different than the tweet id. Make sure you do not confuse the two.

Solutions


In [ ]:
## 1

# Your code goes here

Use this cell to explain / expand your answer.


In [ ]:
## 2 

# Your code goes here

Use this cell to explain / expand your answer.


In [ ]:
## 3

# Your code goes here

Use this cell to explain / expand your answer.


In [ ]:
## 4

# Your code goes here

Use this cell to explain / expand your answer.


In [ ]:
## 5

# Your code goes here

Use this cell to explain / expand your answer.


In [ ]:
## 6

# Your code goes here

Use this cell to explain / expand your answer.


In [ ]:
## 7

# Your code goes here

Use this cell to explain / expand your answer.


In [ ]:
## 8

# Your code goes here

Use this cell to explain / expand your answer.


In [ ]:
## 9

# Your code goes here

Use this cell to explain / expand your answer.


In [ ]:
## 10

# Your code goes here

Use this cell to explain / expand your answer.

Problem 5: Pagerank (30 points)

Some tweets are replies to tweets written by other Twitter users. You can tell which tweets are replies by checking the in_reply_to_screen_name value of a tweet: if it is None, then it is not a reply - otherwise, it is a reply to the user with the screen_name mentioned therein.

For example, in_reply_to_screen_name: None signifies that the tweet is not a reply to another tweet, but in_reply_to_screen_name: northern_bytes signifies that the tweet is a reply to another tweet generated by user northern_bytes.

We are going to construct a graph in the following steps:

  1. We place one node in the graph for each user who has produced more than $20$ english tweets (with distinct ids) in the data;
  2. We place one directed edge from node $u$ to node $v$ iff the total number of english replies from user $u$ to user $v$ in the data is more than $10$. In that case, we will say that '$u$ is connected to $v$'.

1. Graph construction

Construct a pair RDD named linksRDD to store the adjacency list of each node. Specifically, for each node (user) $u$ in the graph, you should have one element of the following form

(screen_name, [screen_name_1, screen_name_2, ..., screen_name_v, ...])

where screen_name corresponds to user $u$ and screen_name_v corresponds to a user $v$ that $u$ is connected to.


In [ ]:
# Your code goes here

How many nodes and edges are there in the graph?


In [ ]:
# Your code goes here

2. Pagerank Computation

Implement and employ PageRank on the graph you constructed above, using $10$ iterations and parameter $\alpha = 0.15$. Store the pagerank scores in an RDD named ranksRDD.


In [ ]:
## Pagerank implementation

# Your code goes here

Report the $5$ nodes with highest pagerank values.


In [ ]:
# Your code goes here


Credits

We are grateful to:

  • Apurva Nandan, Olli Tourunen and Aleksi Kallio from CSC, for providing infrastructure and support with setting up Spark clusters.
  • Kiran Garimella, doctoral student at Aalto, for providing the twitter data.

Updates

If the file is updated (e.g., to add a clarification), a short description of the update will be listed here, as well as on mycourses.

  • The big dataset is posted ("hdfs:///moderndb/input_2.5m.input"). To analyze this dataset, you will connect to a bigger cluster than the one you've been using for the small dataset. You do that by providing some additional environment specifications -- see the cell with the os.environ() calls in Section PySpark Setup. The bigger cluster consists of 8 nodes, with 4 cores, 15gb ram, and 230gb disk each.
  • Added link in Section Server Availability (https://86.50.170.119/resources) to resources webpage for the big cluster.
  • Minor: The order of statements in PySpark Setup has changed a little. Also changed the wording at a couple of places ('big' dataset instead of large).

In [ ]: